AITopics | cluster validity index

2510.13065

Country:

Europe > Switzerland (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

de Amorim, Renato Cordeiro, Makarenkov, Vladimir

Improving internal cluster quality evaluation in noisy Gaussian mixtures

arXiv.org Machine LearningMar-1-2025

Improving clustering quality evaluation in noisy Gaussian mixtures Renato Cordeiro de Amorim Vladimir Makarenkov Abstract Clustering is a well-established technique in machine learning and data analysis, widely used across various domains. Cluster validity indices, such as the Average Silhouette Width, Calinski-Harabasz, and Davies-Bouldin indices, play a crucial role in assessing clustering quality when external ground truth labels are unavailable. However, these measures can be affected by the feature relevance issue, potentially leading to unreliable evaluations in high-dimensional or noisy data sets. We introduce a theoretically grounded Feature Importance Rescaling (FIR) method that enhances the quality of clustering validation by adjusting feature contributions based on their dispersion. It attenuates noise features, clarifies clustering compactness and separation, and thereby aligns clustering validation more closely with the ground truth. Through extensive experiments on synthetic data sets under different configurations, we demonstrate that FIR consistently improves the correlation between the values of cluster validity indices and the ground truth, particularly in settings with noisy or irrelevant features. The results show that FIR increases the robustness of clustering evaluation, reduces variability in performance across different data sets, and remains effective even when clusters exhibit significant overlap. These findings highlight the potential of FIR as a valuable enhancement of clustering validation, making it a practical tool for unsupervised learning tasks where labelled data is unavailable. Mila - Quebec AI Institute, Montreal, QC, Canada.Keywords: Cluster validity indices, data rescaling, noisy data. 1 Introduction Clustering is a fundamenta technique in machine learning and data analysis, which is central to many exploratory methods.

dispersion, fir, noise feature, (15 more...)

2503.00379

Country:

North America > Canada > Quebec > Montreal (0.24)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > Alameda County > Oakland (0.04)
Europe > Poland > Masovia Province > Warsaw (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Hassan, Bryar A., Tayfor, Noor Bahjat, Hassan, Alla A., Ahmed, Aram M., Rashid, Tarik A., Abdalla, Naz N.

From A-to-Z Review of Clustering Validation Indices

arXiv.org Artificial IntelligenceJul-18-2024

Data clustering involves identifying latent similarities within a dataset and organizing them into clusters or groups. The outcomes of various clustering algorithms differ as they are susceptible to the intrinsic characteristics of the original dataset, including noise and dimensionality. The effectiveness of such clustering procedures directly impacts the homogeneity of clusters, underscoring the significance of evaluating algorithmic outcomes. Consequently, the assessment of clustering quality presents a significant and complex endeavor. A pivotal aspect affecting clustering validation is the cluster validity metric, which aids in determining the optimal number of clusters. The main goal of this study is to comprehensively review and explain the mathematical operation of internal and external cluster validity indices, but not all, to categorize these indices and to brainstorm suggestions for future advancement of clustering validation research. In addition, we review and evaluate the performance of internal and external clustering validation indices on the most common clustering algorithms, such as the evolutionary clustering algorithm star (ECA*). Finally, we suggest a classification framework for examining the functionality of both internal and external clustering validation measures regarding their ideal values, user-friendliness, responsiveness to input data, and appropriateness across various fields. This classification aids researchers in selecting the appropriate clustering validation measure to suit their specific requirements.

algorithm, dataset, validity index, (16 more...)

doi: 10.1016/j.neucom.2024.128198

2407.20246

Country:

Asia > Middle East > Iraq > Kurdistan Region > Sulaymaniyah Governorate > Sulaymaniyah (0.04)
Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
South America > Brazil > Paraná > Curitiba (0.04)
(8 more...)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)
Health & Medicine > Therapeutic Area > Immunology (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Kim, Dae-Won, Lee, Kwang H.

A new validity measure for fuzzy c-means clustering

arXiv.org Artificial IntelligenceJul-9-2024

ABSTRACT: A new cluster validity index is proposed for fuzzy clusters obtained from fuzzy c-means algorithm. The proposed validity index exploits inter-cluster proximity between fuzzy clusters. Inter-cluster proximity is used to measure the degree of overlap between clusters. A low proximity value refers to well-partitioned clusters. The best fuzzy c-partition is obtained by minimizing inter-cluster proximity with respect to c. Well-known data sets are tested to show the effectiveness and reliability of the proposed index.

fuzzy cluster, partition, validity index, (15 more...)

2407.06774

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.31)

Wiroonsri, Nathakhun, Preedasawakul, Onthada

A Bayesian cluster validity index

arXiv.org Artificial IntelligenceFeb-3-2024

Selecting the number of clusters is one of the key processes when applying clustering algorithms. To fulfill this task, various cluster validity indices (CVIs) have been introduced. Most of the cluster validity indices are defined to detect the optimal number of clusters hidden in a dataset. However, users sometimes do not expect to get the optimal number of groups but a secondary one which is more reasonable for their applications. This has motivated us to introduce a Bayesian cluster validity index (BCVI) based on existing underlying indices. This index is defined based on either Dirichlet or Generalized Dirichlet priors which result in the same posterior distribution. Our BCVI is then tested based on the Wiroonsri index (WI), and the Wiroonsri-Preedasawakul index (WP) as underlying indices for hard and soft clustering, respectively. We compare their outcomes with the original underlying indices, as well as a few more existing CVIs including Davies and Bouldin (DB), Starczewski (STR), Xie and Beni (XB), and KWON2 indices. Our proposed BCVI clearly benefits the use of CVIs when experiences matter where users can specify their expected range of the final number of clusters. This aspect is emphasized by our experiment classified into three different cases. Finally, we present some applications to real-world datasets including MRI brain tumor images. Our tools will be added to a new version of the recently developed R package ``UniversalCVI''.

bcvi, cluster validity index, dataset, (15 more...)

2402.02162

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > California (0.04)
Asia > Thailand (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine > Health Care Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Wiroonsri, Nathakhun, Preedasawakul, Onthada

A correlation-based fuzzy cluster validity index with secondary options detector

arXiv.org Machine LearningOct-16-2023

The optimal number of clusters is one of the main concerns when applying cluster analysis. Several cluster validity indexes have been introduced to address this problem. However, in some situations, there is more than one option that can be chosen as the final number of clusters. This aspect has been overlooked by most of the existing works in this area. In this study, we introduce a correlation-based fuzzy cluster validity index known as the Wiroonsri-Preedasawakul (WP) index. This index is defined based on the correlation between the actual distance between a pair of data points and the distance between adjusted centroids with respect to that pair. We evaluate and compare the performance of our index with several existing indexes, including Xie-Beni, Pakhira-Bandyopadhyay-Maulik, Tang, Wu-Li, generalized C, and Kwon2. We conduct this evaluation on four types of datasets: artificial datasets, real-world datasets, simulated datasets with ranks, and image datasets, using the fuzzy c-means algorithm. Overall, the WP index outperforms most, if not all, of these indexes in terms of accurately detecting the optimal number of clusters and providing accurate secondary options. Moreover, our index remains effective even when the fuzziness parameter $m$ is set to a large value. Our R package called UniversalCVI used in this work is available at https://CRAN.R-project.org/package=UniversalCVI.

artificial intelligence, dataset, machine learning, (16 more...)

2308.14785

Country:

North America > United States > Wisconsin (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Asia > Thailand (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Gagolewski, Marek, Bartoszuk, Maciej, Cena, Anna

Are Cluster Validity Measures (In)valid?

arXiv.org Artificial IntelligenceAug-2-2022

Internal cluster validity measures (such as the Calinski-Harabasz, Dunn, or Davies-Bouldin indices) are frequently used for selecting the appropriate number of partitions a dataset should be split into. In this paper we consider what happens if we treat such indices as objective functions in unsupervised learning activities. Is the optimal grouping with regards to, say, the Silhouette index really meaningful? It turns out that many cluster (in)validity indices promote clusterings that match expert knowledge quite poorly. We also introduce a new, well-performing variant of the Dunn index that is built upon OWA operators and the near-neighbour graph so that subspaces of higher density, regardless of their shapes, can be separated from each other better.

artificial intelligence, evolutionary algorithm, machine learning, (17 more...)

doi: 10.1016/j.ins.2021.10.004

2208.01261

Country:

Europe > Poland > Masovia Province > Warsaw (0.04)
Oceania > Australia (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Instructional Material (0.87)
Research Report > New Finding (0.67)
Research Report > Experimental Study (0.67)

Industry: Education (0.68)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

arXiv.org Machine LearningSep-23-2021

Clustering performance analysis using new correlation based cluster validity indices

Wiroonsri, Nathakhun

There are various cluster validity measures used for evaluating clustering results. One of the main objective of using these measures is to seek the optimal unknown number of clusters. Some measures work well for clusters with different densities, sizes and shapes. Yet, one of the weakness that those validity measures share is that they sometimes provide only one clear optimal number of clusters. That number is actually unknown and there might be more than one potential sub-optimal options that a user may wish to choose based on different applications. We develop two new cluster validity indices based on a correlation between an actual distance between a pair of data points and a centroid distance of clusters that the two points locate in. Our proposed indices constantly yield several peaks at different numbers of clusters which overcome the weakness previously stated. Furthermore, the introduced correlation can also be used for evaluating the quality of a selected clustering result. Several experiments in different scenarios including the well-known iris data set and a real-world marketing application have been conducted in order to compare the proposed validity indices with several well-known ones.

correlation, optimal number, validity measure, (12 more...)

2109.11172

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Artificial IntelligenceJul-7-2021

A Simplified Framework for Air Route Clustering Based on ADS-B Data

Duong, Quan, Tran, Tan, Pham, Duc-Thinh, Mai, An

The volume of flight traffic gets increasing over the time, which makes the strategic traffic flow management become one of the challenging problems since it requires a lot of computational resources to model entire traffic data. On the other hand, Automatic Dependent Surveillance - Broadcast (ADS-B) technology has been considered as a promising data technology to provide both flight crews and ground control staff the necessary information safely and efficiently about the position and velocity of the airplanes in a specific area. In the attempt to tackle this problem, we presented in this paper a simplified framework that can support to detect the typical air routes between airports based on ADS-B data. Specifically, the flight traffic will be classified into major groups based on similarity measures, which helps to reduce the number of flight paths between airports. As a matter of fact, our framework can be taken into account to reduce practically the computational cost for air flow optimization and evaluate the operational performance. Finally, in order to illustrate the potential applications of our proposed framework, an experiment was performed using ADS-B traffic flight data of three different pairs of airports. The detected typical routes between each couple of airports show promising results by virtue of combining two indices for measuring the clustering performance and incorporating human judgment into the visual inspection.

ad-b data, airport, trajectory, (14 more...)

doi: 10.1109/RIVF.2019.8713685

2107.12869

Country:

Europe (0.06)
North America > United States > Washington > King County > Seattle (0.05)
North America > United States > New York > New York County > New York City (0.04)
(5 more...)

Genre: Research Report (0.50)

Industry:

Transportation > Air (1.00)
Consumer Products & Services > Travel (1.00)
Transportation > Infrastructure & Services (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Guan, Shuyue, Loew, Murray

An Internal Cluster Validity Index Based on Distance-based Separability Measure

arXiv.org Machine LearningSep-2-2020

To evaluate clustering results is a significant part in cluster analysis. Usually, there is no true class labels for clustering as a typical unsupervised learning. Thus, a number of internal evaluations, which use predicted labels and data, have been created. They also named internal cluster validity indices (CVIs). Without true labels, to design an effective CVI is not simple because it is similar to create a clustering method. And, to have more CVIs is crucial because there is no universal CVI that can be used to measure all datasets, and no specific method for selecting a proper CVI for clusters without true labels. Therefore, to apply more CVIs to evaluate clustering results is necessary. In this paper, we propose a novel CVI - called Distance-based Separability Index (DSI), based on a data separability measure. We applied the DSI and eight other internal CVIs including early studies from Dunn (1974) to most recent studies CVDD (2019) as comparison. We used an external CVI as ground truth for clustering results of five clustering algorithms on 12 real and 97 synthetic datasets. Results show DSI is an effective, unique, and competitive CVI to other compared CVIs. In addition, we summarized the general process to evaluate CVIs and created a new method - rank difference - to compare the results of CVIs.

artificial intelligence, dataset, machine learning, (17 more...)

2009.01328

Country:

North America > United States > District of Columbia > Washington (0.04)
North America > United States > Wisconsin (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)